A paired-end whole-genome sequencing approach enables comprehensive characterization of transgene integration in rice

Efficient, accurate molecular characterization of genetically modified (GM) organisms is challenging, especially for those transgenic events transferred with genes/elements of recipient species. Herein, we decipher the comprehensive molecular characterization of one novel GM rice event G281 which was transferred with native promoters and an RNA interference (RNAi) expression cassette using paired-end whole genome sequencing (PE-WGS) and modified TranSeq approach. Our results show that transgenes integrate at rice chromosome 3 locus 16,439,674 included a 36 bp deletion of rice genomic DNA, and the whole integration contains two copies of the complete transfer DNA (T-DNA) in a head-to-head arrangement. No unintended insertion or backbone sequence of the transformed plasmid is observed at the whole genome level. Molecular characterization of the G281 event will assist risk assessment and application for a commercial license. In addition, we speculate that our approach could be further used for identifying the transgene integration of cisgenesis/intragenesis crops since both ends of T-DNA in G281 rice were from native gene or elements which is similar with that of cisgenesis/intrasgenesis. Our results from the in silico mimicking cisgenesis event confirm that the mimic rice Gt1 gene insertion and its flanking sequences are successfully identified, demonstrating the applicability of PE-WGS for molecular characterization of cisgenesis/intragenesis crops.

Strengths of the paper: -The authors laid out a clear and precise approach to identifying the exact location of a transgene using paired-end whole genome sequencing.
-Coupling PE-WGS with PCR and digital droplet ensured precision in identification and location of the transgene in the rice genome chromosome 3.
Major concern: The basis of the paper's strength is not factual -Authors have based the strength of their approach in its effectiveness in identifying intragene/cisgene in GM rice event G281. They assert that this line is "similar to" cisgenic/intragenic. However, this rice line is not cisgenic/intragenic and the argument of it being "similar to" cisgenic/intragenic is not a logical one. This assertion is obviously misleading. Authors should have used an actual cisgenic/intragenic line if they wanted to front this argument as the paper's strength. Authors should therefore clearly indicate that GM rice event G281 is transgenic and change this in the running title, abstract and within the text and have the paper evaluated on that basis. The GM rice event G281 has genes from Maize, human and Pseudomonas putida making it transgenic.
-In their own words, authors agree that investigating intragenesis is more complex than line G281 (which us essentially transgenic) -lines 343-345.
Other concerns -In the materials and methods (lines 143 -146), authors indicate that they filtered false-positive reads by searching for positions of similar sequences between plasmids and references using BLASTN. However, they did not indicate the results of this comparison in the results section. This is important since some plasmid sequences which authors considered false positives were observed in the transgenic and non-transgenic wildtype (lines 315-320 and 348 -350).
-Authors indicate that the sequences homologous to transgenic ones in the wildtype "…must be derived from homogenous sequences between the host genome and the transformed plasmid, sequencing bias, or sequencing data contamination" -Lines 348 -355. If this approach is to be solid enough, authors need to be sure where these sequences came from rather than guess.
Reviewer #2 (Remarks to the Author): Overall, the manuscript is in good shape. The language is adequate and easiy to read. Although its novelty is not very strong, the authors' demonstrated that PE-WGS is a useful tool for molecular characterization analysis of cisgenesis/intragenesis crops. This is valuable to the relevent scientific community. In order to make the conclusion more resonable, the authors should make some revision and explanation. 1) for 3.2, how 69 read pairs that were retained as candidate reads for determining the T-DNA insertion site distributed? How much on 3' junction region, how much on 5'junction region. How to determin that the sporadic read pairs mapped to other chromosomes might be false positive reads? It should be proved by experiment. 2) for 3.3 copy number of exogenous DNA. There only analysis the copy number of foreign gene G6-EPSPS and hlf. Please add how to analysis copy of RNAi of rice CYP81A6 gene, because the purpose of this article is that the using of PE-WGS on molecular characterization analysis of cisgenesis/intragenesis crops.
3) There should have the IGV analysis of between 69 read pairs that were retained as candidate reads for determining the T-DNA insertion site and T-DNA sequence. 4) How to know there have not backbone insertion from Figure 4. 5) DNA extraction in 2.1 should be put at 2.2 part. 6) please revised some mistakes such as: (1) Line116, xiushui-110, Line117 xiushui 110, please full text unification (2) Line 115, gene should be orthographic (3) Line 126, paired-end whole-genome sequencing should be deleted.

Point to point responses to the Reviewers' comments
Responses to comments from Reviewer #1 Reviewer #1 (Remarks to the Author): Strengths of the paper: -The authors laid out a clear and precise approach to identifying the exact location of a transgene using paired-end whole genome sequencing.
-Coupling PE-WGS with PCR and digital droplet ensured precision in identification and location of the transgene in the rice genome chromosome 3.
Comment 1: Major concern: The basis of the paper's strength is not factual -Authors have based the strength of their approach in its effectiveness in identifying intragene/cisgene in GM rice event G281. They assert that this line is "similar to" cisgenic/intragenic. However, this rice line is not cisgenic/intragenic and the argument of it being "similar to" cisgenic/intragenic is not a logical one. This assertion is obviously misleading.
Authors should have used an actual cisgenic/intragenic line if they wanted to front this argument as the paper's strength. Authors should therefore clearly indicate that GM rice event G281 is transgenic and change this in the running title, abstract and within the text and have the paper evaluated on that basis. The GM rice event G281 has genes from Maize, human and Pseudomonas putida making it transgenic.
Answer: Thank you very much for pointing out the inaccuracies in our description on the G281 rice event. In the section of "Plant material", we introduced the detail information of transgenic rice event G281, which was produced by inserting the T-DNA containing three expression cassettes of hLF, G6 EPSPS, and the RNAi of rice CYP81A6 gene into the non-GM line 115 Xiushui-110 via Agrobacterium-mediated transformation. Since the elements and exogenous genes from other species existed into the T-DNA region, the G281 rice is transgenic. According to the definition of cisgenic/intragenic line, all the elements should be from native genome, therefore the G281 rice is not a real cisgenic/intragenic line. We have clearly indicated that GM rice event G281 is transgenic and changed the running title in the whole MS.
In our work, we aim to explore the molecular characterization of transgenic rice G281 event, which was produced by introducing the T-DNAs. The T-DNA contains three gene expression cassettes (the hFL expression cassette, the EPSPS expression cassette, and the RNA interference cassette) in turn. Initially, we failed to identify the inserted sites and flanking sequences using our previously developed pipeline of TranSeq. Then, we found that both end of DNA sequences connected with LB and RB in T-DNA were from the rice recipient genome (rice Gt1 promoter and RNAi of CYP81A6 gene). It is the reason for failure with TranSeq analysis. Therefore, we modified and update the pipelines of TranSeq to remove the background noise from the native gene or DNA sequences of rice Gt1 promoter and CYP81A6 gene, and successfully found the T-DNA inserted sites and flanking sequences. Considering the definition of Cisgenesis/intragenesis, only the DNAs from native genome or a cross-compatible species should be transferred and inserted into recipient genome, the DNAs normally were integrated at the other locus site. For identify the native DNAs inserted site and flanking sequences of Cisgenesis/intragenesis, the strategy is essentially to find the junctions between native DNAs and recipient genome DNA and to eliminate background noise from native DNAs of recipient genome, which is quite similar with that of G281 rice event analysis (As shown in the following figure). Therefore, we believe that the modified pipeline could be used for identifying the native DNAs inserted sites and flanking sequences of Cisgenesis/intragenesis.
In cisgenic/intragenic line, the DNA sequences of 5' and 3' end of T-DNA are both from native genome. In G281 line, the DNA sequences of 5' and 3' end of T-DNA (rice Gt1 promoter and RNAi of CYP81A6 gene) are also from rice genome. This is the reason why we used the words of "similar to". Now we know that the words of "similar to" is not rigorous with your kindly comment. In the new revision, we have removed any mentioning of Cisgenesis/intragenesis in the title and the main text that may mislead readers to think G281 is a cisgenesis/intragenesis line.
According to your comment, we have contacted many collaborators working on breeding new crop lines with new breeding techniques, but we failed to find a true cisgenesis/intragenesis line. Although we did not have one real Cisgenesis/intragenesis line, we hope to verify it using simulated NGS data generated randomly from one mimic rice line with native Gt1 gene insertion in the revised version. The result showed that the mimic Gt1 gene insertion could be identified with the detail inserted site, flanking sequences, and copy number (Table 4; Supplemental File 1, 4; Supplemental Figure S1 S3, and S4; supplemental Table S1 and S5). We believe that these results can do favor to confirm that our developed pipelines were helpful for the analysis of cisgenesis/intragenesis in theory.
Nevertheless, one point we are trying to make in the original and the revised MS is that we believe that some of the analyses we did in the G281 study could potentially be useful in analyzing a true cisgenesis/intragenesis line, and I do agree with you that we didn't clearly state that G281 is not a cisgenesis/intragenesis line, and we hope you will find that in the revised MS we paid great attention to this concern and hope that the revised text could make our point more accurate.
Comment 2: -In their own words, authors agree that investigating intragenesis is more complex than line G281 (which us essentially transgenic) -lines 343-345.
Answer: In general, full molecular characterization of transgenic crops contains the information of exogenous DNA insertion site, flanking sequence, whole exogenous DNA arrangement and sequence of the insertion, and copy number of inserted DNAs. However, our developed pipelines have the potential in ientifying native DNAs inserted site and flanking sequences of cisgenesis/intragenesis. It is still challenging in obtaining the whole exogenous DNA arrangement and sequence of the insertion in cisgenesis/intragenesis using our modified pipelines. That is why we said that investigating full molecular characterization of cisgenesis/intragenesis is more complex than line G281 in the manuscript.

Other concerns
Comment 3:-In the materials and methods (lines 143 -146), authors indicate that they filtered false-positive reads by searching for positions of similar sequences between plasmids and references using BLASTN. However, they did not indicate the results of this comparison in the results section. This is important since some plasmid sequences which authors considered false positives were observed in the transgenic and non-transgenic wild type (lines 315-320 and 348 -350).
Answer: Thanks for your kind comment. As shown in Table 1, a total 257 reads were filtered from 262 candidate reads in non-transgenic wild type, and a total 253 reads were filtered from 322 candidate reads in transgenic G281 reads by searching for positions of similar sequences between plasmids and references using BLASTN in modified pipeline, these results indicated that most of the false positive reads could be effectively culled out, which will do great favor to identify the real inserted site and flanking sequencing and to decrease further experimental confirmation. Also, we have added the details of comparison in Table2 in the revised version.
Comment 4:-Authors indicate that the sequences homologous to transgenic ones in the wild type "…must be derived from homogenous sequences between the host genome and the transformed plasmid, sequencing bias, or sequencing data contamination" -Lines 348 -355. If this approach is to be solid enough, authors need to be sure where these sequences came from rather than guess.
Answer: Thanks for your kind comment. In the new revision, we have listed all the 262 read pairs and described the details of where they came from in Table 2.

Reviewer #2 (Remarks to the Author):
Overall, the manuscript is in good shape. The language is adequate and easiy to read. Although its novelty is not very strong, the authors' demonstrated that PE-WGS is a useful tool for molecular characterization analysis of cisgenesis/intragenesis crops. This is valuable to the relevent scientific community. In order to make the conclusion more resonable, the authors should make some revision and explanation. Comment 1: for 3.2, how 69 read pairs that were retained as candidate reads for determining the T-DNA insertion site distributed? How much on 3' junction region, how much on 5'junction region. How to determin that the sporadic read pairs mapped to other chromosomes might be false positive reads? It should be proved by experiment.
Answer: Thanks for your kind comment. We have added the details of 69 read pairs in Table 2 in the revised version. Among the 69 read pairs, a total of 22 read Paris covered the 5' junction region, a total of 40 read pairs covered the 3' junction region, and the other seven read pairs covered different loci which were considered and verified as false positive reads. In our work, the mean sequencing depth was 28.91, which indicated that each nucleotide should be sequenced with ~28.91 times in theory. For example, a total of 22 read pairs were observed around the 5' junction region, and a total of 40 read pairs were observed around the 3' junction region. However, the other seven read pairs showed seven different loci, which meant that only one read pair was observed in each locus, therefore, we speculated that those seven read pairs were false positive. Also, we have designed the primers according to the seven read pairs and verified them using PCR, the results showed that two read pairs have no amplification in both we believed that no backbone sequence of plasmid were inserted into the G281 line.
Comment 5: DNA extraction in 2.1 should be put at 2.2 part.
Answer: Thanks for your kind comment. We have revised it in the new MS.
Comment 6: please revised some mistakes such as: (1) Line116, xiushui-110, Line117 xiushui 110, please full text unification Answer: Thanks for your kind comment. We have revised it in the new MS.
(2) Line 115, gene should be orthographic Answer: Thanks for your kind comment. We have revised it in the new MS.
Answer: Thanks for your kind comment. We have revised it in the new MS.
(4) Line 170 hFL is hlf gene Answer: Thanks for your kind comment. We have revised it in the new MS.
Answer: Thanks for your kind comment. The copy number of RNAi CYP81A6 was analyzed with value of 2.00 ( Table 2).

Reviewers' comments:
Reviewer #1 (Remarks to the Author): Please see attached file Reviewer #2 (Remarks to the Author): The revised article has answered my question, and I think it meets the publication requirements.

General observation:
The authors have taken their time to revise the manuscript and clarify some of the issues raised in the first round of review. Authors added another component where they put the Gt1 gene sequences in the rice chromosome 5. They then did an analysis to prove that they can identify a cis gene in rice. Authors should highlight the assumptions of what they did which include: 1) During sequencing all the sequences will be as they envision 2) Only the inserted cis gene will be amplified during sequencing. The assumptions should be clearly highlighted in the manuscript. There are other corrections that need to be undertaken to enhance technical clarity of the manuscript. I have listed these per section below.

Novelty
The manuscript is not very high on novelty as similar reports using PE-WGS have been published. However, the integration of different approaches including digital drop-let PCR and the effort the authors took is commendable. Authors still face the challenges involved in picking up a cis gene in an endogenous genomic background. These challenges should be highlighted in the manuscript as actual challenges.

Abstract:
The abstract needs to be reviewed lines 18 -22 still fronts the cisgenesis as the strength.
Authors should check and have this rectified. Line 18: Remove the words "novel" and "cisgenesis/intragenesis"

Materials and methods
Line 137: Did authors mean to use" …and…" instead of "…or…? Lines 144 -146: Authors need to indicate the platforms they used for the different aspects of the analysis. They could even create a flow diagram on the same eg index removal, trimming etc Lines 148 -150: What was the set of paired end reads, which region was it corresponding to? Line 149: Spiked into data?, which paired end reads?, basis for using knowledge. If the authors didn't have this prior information, then these "false positives" would most likely be considered as part of the data.
Lines 236 -243: The Gt1 promoter and CYP81A6 gene were present in both plasmid, transgenic line and endogenously in the rice genome. Why would authors consider this as a false positive in the transgenic yet they are using controls where it is present? It can only be false positive if it was observed in the transgenic plant, yet it was not expected.
Lines 247 -248: Authors have mentioned the 7 'sporadic' sequences were observed in the 62 unique sequences. Authors have referred these seven as "false positives", I am not sure these hits are unexpected. Since the sequencing approach used here was not targeting the transgene, then the argument that they are false positive may not be valid since other sequences in the genome would be expected. Authors should highlight this as a shortcoming to their approach and give possible solutions. Lines 247-255: The PCRs of the false positive indicates that they were present in both negative control and transgenic. This therefore shows weakness in their approach, Line 275: replace "nien" with "nine" Line 277: replace "…including" with "included" Line 519: "detail" to be replaced with "detailed" Authors should also thoroughly go through the manuscript and correct other small errors and omissions that could not be listed here.

Abstract:
The abstract needs to be reviewed lines 18 -22 still fronts the cisgenesis as the strength. Authors should check and have this rectified.
Answer: Thanks for your comment. We have deleted the words of cisgenesis/intragenesis.

Materials and methods
Line 137: Did authors mean to use" …and…" instead of "…or…? would most likely be considered as part of the data.

Answer
Answer: Thanks for your comment. We do agree with your opinion. In this section, the word of "false positive" is not the most accurate term. We have deleted the words of "false positive" and rehearsed it in the whole MS. In original TranSeq analysis, much more candidate reads would be obtained including those from homologous sequence of endogenous genome, which significantly increased the difficulty in identifying the real and correct reads. However, those reads from homologous sequence of endogenous genome could be filtered out from the candidate reads with our modified TranSeq pipeline, which significantly narrows the candidate reads and improves the accuracy of candidate reads. In this study, we did not know the structure of the transgene integration in the genome of G281 line, and we only have the whole sequence of plasmid with the T-DNA. Through sequence BLASTN analysis, we knew that the Gt1 promoter and CYP81A6 gene in the T-DNA region of plasmid was from rice genome. Then, we can filter out the reads generated from Gt1 promoter and CYP81A6 gene using modified TranSeq pipeline, and obtain the more  Lines 247-255: The PCRs of the false positive indicates that they were present in both negative control and transgenic. This therefore shows weakness in their approach, Answer: Thanks for your comment. In the whole genome resequencing analysis, sporadic reads are often existed because of the sequencing error, accidental result, noise sequences, and alignment jitter. This is the limitation of the whole genome re-sequencing analysis.
However, the number of sporadic reads is so far below the number of sequenced that it can be considered negligible. We have pointed out this limitation in the revised MS. (Line 260-

263)
Line 275: replace "nien" with "nine" Answer: Thanks for your comment. We have revised it. (Line 284) Line 277: replace "…including" with "included" Answer: Thanks for your comment. We have revised it. (Line 287) Line 519: "detail" to be replaced with "detailed" Answer: Thanks for your comment. We have revised it. (Line 532) Authors should also thoroughly go through the manuscript and correct other small errors and omissions that could not be listed here.
Answer: Thanks for your comment. We have read and check the whole MS to correct the typing errors and omissions. All revised place was highlighted with red color. Equally, authors mention that there were completely no sequences that were observed to hit the non-T-DNA region. However, figure 4 indicates some few alignments with the "Ori" region. Authors should explain this disparity. See below circled in red:

Figures and tables
Answer: Thanks for your comment. We found that there are four reads and two reads mapped to the Ori region in WT and G281, respectively. However, these sequences might come from the rice genome background or the sequencing contamination considering the sequencing depth of ~29 X. Therefore, we believed that these sequences were not from the plasmid backbone really. Also, we rephrased and discussed the results of IGV analysis in the new MS. (Line 340-348) I commend the efforts taken by the authors to improve the manuscript based on previous comments. The paper is now clear to me and I approve its publication. I have no additional comments.